MPPrioritizedReplayBuffer

class cpprb.MPPrioritizedReplayBuffer(size, env_dict=None, alpha=0.6, *, eps=0.0001, **kwargs)

Bases: cpprb.PyReplayBuffer.MPReplayBuffer

Multi-process support Prioritized Replay Buffer class to store transitions with priorities.

This class can work on multi-process without manual lock.

In this class, these transitions are sampled with corresponding to priorities.

Notes

This class assumes single learner (sample, update_priorities) and multiple explorers (add).

Methods Summary

add(self, *[, priorities])

Add transition(s) into replay buffer.

clear(self)

Clear replay buffer

get_max_priority(self)

Get the max priority of stored priorities

on_episode_end(self)

Call on episode end

sample(self, batch_size[, beta])

Sample the stored transitions.

update_priorities(self, indexes, priorities)

Update priorities

Methods Documentation

add(self, *, priorities=None, **kwargs)

Add transition(s) into replay buffer.

Multple sets of transitions can be added simultaneously. This method can be called from multiple explorer processes without manual lock.

Parameters
  • priorities (array like or float, optional) – Priorities of each environment. When no priorities are passed, the maximum priorities until then are used.

  • **kwargs (array like or float or int) – Transitions to be stored.

Returns

The first index of stored position. If all transitions are stored into NstepBuffer and no transtions are stored into the main buffer, None is returned.

Return type

int or None

Raises

KeyError – If any values defined at constructor are missing.

Warning

All values must be passed by key-value style (keyword arguments). It is user responsibility that all the values have the same step-size.

clear(self)void

Clear replay buffer

get_max_priority(self)float

Get the max priority of stored priorities

Returns

max_priority – the max priority of stored priorities

Return type

float

on_episode_end(self)void

Call on episode end

Notes

Calling this function at episode end is the user responsibility, since episode exploration can be terminated at certain length even though any done flags from environment is not set.

sample(self, batch_size, beta=0.4)

Sample the stored transitions.

Transisions are sampled depending on correspoinding priorities with speciped size. This method can be called from single learner process.

Parameters
  • batch_size (int) – Sampled batch size

  • beta (float, optional) – The exponent of weight for relaxation of importance sampling effect, whose default value is 0.4

Returns

sample – Batch size of samples which also includes ‘weights’ and ‘indexes’

Return type

dict of ndarray

Notes

When ‘beta’ is 0, weights become uniform. Wen ‘beta’ is 1, weight becomes usual importance sampling. The ‘weights’ are also normalized by the weight for minimum priority (\(= w_{i}/\max_{j}(w_{j})\)), which ensure the weights \(\leq\) 1.

update_priorities(self, indexes, priorities)

Update priorities

Update priorities specified with indicies. Ignores indices which updated values after the last calling of sample() method. This method can be called from single learner process.

Parameters
  • indexes (array_like) – indexes to update priorities

  • priorities (array_like) – priorities to update

Raises

TypeError – When indexes or priorities are None:

__init__()

Initialize PrioritizedReplayBuffer

Parameters
  • size (int) – buffer size

  • env_dict (dict of dict, optional) – dictionary specifying environments. The keies of env_dict become environment names. The values of env_dict, which are also dict, defines “shape” (default 1) and “dtypes” (fallback to default_dtype)

  • alpha (float, optional) – \(\alpha\) the exponent of the priorities in stored whose default value is 0.6

  • eps (float, optional) – \(\epsilon\) small positive constant to ensure error-less state will be sampled, whose default value is 1e-4.

See also

ReplayBuffer

Any optional parameters at ReplayBuffer are valid, too.

Notes

The minimum and summation over certain ranges of pre-calculated priorities \((p_{i} + \epsilon )^{ \alpha }\) are stored with segment tree, which enable fast sampling.

_encode_sample(self, idx)
add(self, *, priorities=None, **kwargs)

Add transition(s) into replay buffer.

Multple sets of transitions can be added simultaneously. This method can be called from multiple explorer processes without manual lock.

Parameters
  • priorities (array like or float, optional) – Priorities of each environment. When no priorities are passed, the maximum priorities until then are used.

  • **kwargs (array like or float or int) – Transitions to be stored.

Returns

The first index of stored position. If all transitions are stored into NstepBuffer and no transtions are stored into the main buffer, None is returned.

Return type

int or None

Raises

KeyError – If any values defined at constructor are missing.

Warning

All values must be passed by key-value style (keyword arguments). It is user responsibility that all the values have the same step-size.

clear(self)void

Clear replay buffer

get_all_transitions(self, bool shuffle: bool = False)

Get all transitions stored in replay buffer.

Parameters

shuffle (bool, optional) – When True, transitions are shuffled. The default value is False.

Returns

transitions – All transitions stored in this replay buffer.

Return type

dict of numpy.ndarray

get_buffer_size(self)size_t

Get buffer size

Returns

buffer size

Return type

size_t

get_max_priority(self)float

Get the max priority of stored priorities

Returns

max_priority – the max priority of stored priorities

Return type

float

get_next_index(self)size_t

Get the next index to store

Returns

the next index to store

Return type

size_t

get_stored_size(self)size_t

Get stored size

Returns

stored size

Return type

size_t

is_Nstep(self)bool

Get whether use Nstep or not

Returns

use_nstep

Return type

bool

on_episode_end(self)void

Call on episode end

Notes

Calling this function at episode end is the user responsibility, since episode exploration can be terminated at certain length even though any done flags from environment is not set.

sample(self, batch_size, beta=0.4)

Sample the stored transitions.

Transisions are sampled depending on correspoinding priorities with speciped size. This method can be called from single learner process.

Parameters
  • batch_size (int) – Sampled batch size

  • beta (float, optional) – The exponent of weight for relaxation of importance sampling effect, whose default value is 0.4

Returns

sample – Batch size of samples which also includes ‘weights’ and ‘indexes’

Return type

dict of ndarray

Notes

When ‘beta’ is 0, weights become uniform. Wen ‘beta’ is 1, weight becomes usual importance sampling. The ‘weights’ are also normalized by the weight for minimum priority (\(= w_{i}/\max_{j}(w_{j})\)), which ensure the weights \(\leq\) 1.

update_priorities(self, indexes, priorities)

Update priorities

Update priorities specified with indicies. Ignores indices which updated values after the last calling of sample() method. This method can be called from single learner process.

Parameters
  • indexes (array_like) – indexes to update priorities

  • priorities (array_like) – priorities to update

Raises

TypeError – When indexes or priorities are None: